Volumes of logistic regression models with applications to model selection
نویسنده
چکیده
Logistic regression models with n observations and q linearly-independent covariates are shown to have Fisher information volumes which are bounded below by π and above by ( n q ) π. This is proved with a novel generalization of the classical theorems of Pythagoras and de Gua, which is of independent interest. The finding that the volume is always finite is new, and it implies that the volume can be directly interpreted as a measure of model complexity. The volume is shown to be a continuous function of the design matrix X at generic X, but to be discontinuous in general. This means that models with sparse design matrices can be significantly less complex than nearby models, so the resulting model-selection criterion prefers sparse models. This is analogous to the way that `regularisation tends to prefer sparse model fits, though in our case this behaviour arises spontaneously from general principles. Lastly, an unusual topological duality is shown to exist between the ideal boundaries of the natural and expectation parameter spaces of logistic regression models. 1 Overview and context of results Any full-rank, q×n matrix X with q ≤ n is the design matrix of a unique logistic regression model SX for binary data y ∈ {0, 1} [17]. Here, the n components of y are considered to be draws from n independent Bernoulli random variables and we are using the canonical link function. When equipped with the Fisher information metric, the q-dimensional parameter space of SX becomes a Riemannian manifold [16]. Further, by Chentsov’s theorem [8, 2], the Fisher information metric is the only natural metric on SX , in the sense that it is the only metric which is invariant under natural statistical transformations related to sufficient statistics. The geometry of SX is therefore likely to be important and useful in understanding the behaviour of SX . In this paper, we concentrate on the simplest geometric invariant of SX , namely its volume Vol(SX). We show that Vol(SX) is always finite, which was previously unknown, and we prove the following bounds. Theorem 1. π ≤ Vol(SX) ≤ ( n q ) π. These bounds are based on Theorem 9, which is a novel generalisation of the classical theorems of Pythagoras and de Gua [29, p. 207] and is of independent interest. 1 ar X iv :1 40 8. 08 81 v3 [ m at h. ST ] 1 7 O ct 2 01 4 Our result that Vol(SX) is finite has a number of theoretical consequences for the logistic regression model SX , since it shows that SX satisfies the common regularity condition that its Jeffreys prior should be proper. One consequence of this is that Vol(SX) can be directly interpreted as a measure of model complexity, since a simple, monotonic function of Vol(SX) then approximates the parametric complexity for large n [23][13, eqn. 2.21]. Here, the parametric complexity is an information-theoretic measure of the statistical size of SX which can be subtracted from the maximized log-likelihood to give a natural measure of the parsimony of SX as a model for data y [13, eqn. 2.20]. The corresponding modelselection criterion is known as the minimum description length (MDL) criterion [5, 25] and it has many desirable properties, such as almost sure consistency for parametric models and the ability to select a data-generating model from a countable set of models for all sufficiently large n with probability 1 [4]. No previous logistic regression studies have used the volume as a measure of model complexity, though a few studies have used other variants of MDL: [14] used a mixture MDL approach [15] in which a normal prior was placed on the regression coefficients and MDL principles were used to choose the hyper-parameters; [31] and [20] were based on the approximation of [21] and its 2-part code approach; and [10] used a renormalized NML criterion [24] adapted from linear regression to logistic regression with a weighting method. The above connections with MDL show that Vol(SX) is an important measure of model complexity, but we also show that it has some remarkable geometric properties. Perhaps the strangest and most useful property is that Vol(SX) is a discontinuous function of X. Some design matrices, such as those with some rows consisting only of zeroes, are significantly less complex than nearby design matrices. This means that a model-selection criterion based on Vol(SX) will tend to choose models with sparse design matrices over models with design matrices with many small entries. This behaviour is analogous to (though different from) the way that `-regularised regression models tend to choose model fits with coefficients equal to 0 over model fits with small coefficients [27, 28]. We derive an approximation to Vol(SX) under the mild assumptions that n is large, the rows of X are realisations of independent and identically distributed (IID) random variables and X has full rank with probability 1, plus a more technical condition on the covariate distribution (see Section 6.2). This approximation to Vol(SX) then gives the following model-selection criterion. Definition 1 (Approximate volume criterion). Given a countable set of competing logistic regression models for binary data y ∈ {0, 1} with n observations, the approximate volume criterion advocates choosing the model SX with the smallest value of − log p(y|β̂(y)) + q 2 log π 2 + 1 2 log ( n− n0 q ) (1) where log p(y|β̂(y)) is the maximized log-likelihood and the design matrix of SX has n rows, q columns and exactly n0 rows with all entries equal to 0. The main result of [20] implies that this criterion is strongly consistent, meaning that it will select the correct model almost surely as the sample size n goes to infinity. As a proof of principle, we apply this model-selection criterion to a simulated image processing problem, giving promising results (see Figure 2). Our approach to this problem couples the approximate volume criterion with `-regularisation [27, 28], making our results applicable to the case q > n where the number of potential covariates is larger than the number of observations (see Section 6.4). Lastly, we consider the behaviour of the logistic regression model SX for large parameter values when X is generic, meaning that any q of the rows of X are linearly independent. We first show that, while Vol(SX) is a discontinuous function of X in general, it is continuous at generic X. This raises the possibility that a closed-form expression for Vol(SX) might exist for generic X. We second consider the relationship between two natural polygonal decompositions of the ideal boundaries of the natural and expectation
منابع مشابه
Applying Combined Approach of Sequential Floating Forward Selection and Support Vector Machine to Predict Financial Distress of Listed Companies in Tehran Stock Exchange Market
Objective: Nowadays, financial distress prediction is one of the most important research issues in the field of risk management that has always been interesting to banks, companies, corporations, managers and investors. The main objective of this study is to develop a high performance predictive model and to compare the results with other commonly used models in financial distress prediction M...
متن کاملAn Overview of the New Feature Selection Methods in Finite Mixture of Regression Models
Variable (feature) selection has attracted much attention in contemporary statistical learning and recent scientific research. This is mainly due to the rapid advancement in modern technology that allows scientists to collect data of unprecedented size and complexity. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a sma...
متن کاملComparison of ordinary logistic regression and robust logistic regression models in modeling of pre-diabetes risk factors
Background: Regarding the increased risk of developing type 2 diabetes in pre-diabetic people, identifying pre-diabetes and determining of its risk factors seems so necessary. In this study, it is aimed to compare ordinary logistic regression and robust logistic regression models in modeling pre-diabetes risk factors. Methods: This is a cross-sectional study and conducted on 6460 people, over ...
متن کاملFactors Influencing Drug Injection History among Prisoners: A Comparison between Classification and Regression Trees and Logistic Regression Analysis
Background: Due to the importance of medical studies, researchers of this field should be familiar with various types of statistical analyses to select the most appropriate method based on the characteristics of their data sets. Classification and regression trees (CARTs) can be as complementary to regression models. We compared the performance of a logistic regression model and a CART in predi...
متن کاملRanking stocks of listed companies on Tehran stock exchange using a hybrid model of decision tree and logistic regression
Much research has introduced linear or nonlinear models using statistical models and machine learning tools in artificial intelligence to estimate Iran's rate of return. The primary purpose of these methods is simultaneously use different independent variables to improve stock return rates' modeling. However, in predicting the rate of return, in addition to the modeling method, the degree of co...
متن کاملComparing Multi-level and Ordinary Logistic Regression Models in Evaluating Factors Related to Periodontal Clinical Attachment Loss
Background and Objectives: Periodontal disease is one of the most common oral health problems. Clinical attachment loss occurs in sever periodontal cases (CAL>3). In this study, we applied a classic regression model and the models that consider the hierarchical structure of the data to estimate and compare the effect of different factors on CAL. Methods: This cross-sectional study was perfo...
متن کامل